Module 3
University of South Florida
Unsupervised learning: clustering, association, dimensionality reduction
In Finance : Clustering & Dimensionality reductions
K-means clustering
Hierarchical clustering
Principal Component Analysis (PCA)
Banking
Suppose you are a bank and have hundreds of thousands of customers and more than 100 features describing each
Unsupervised learning algorithms can be used to divide your customers into clusters
To anticipate their needs
Communicate more effectively
Banking
Suppose you are a bank and have hundreds of thousands of customers and more than 100 features describing each
Or, you can reduce the features needed from 100 to 15 features
Removes redundancy and efficiency for further analysis
Feature scaling is a technique to standardize the range of independent variables (X, or features).
Common methods are:
Not all algorithms require feature scaling. Some algorithms are especially susceptible to scale effect.
Issues caused by improper scaling:
\[ Value \rightarrow \frac{Value - Mean}{SD}\]
\[ Value \rightarrow \frac{Value - Minimum}{Maximum - Minimum}\]
You are given the following vector of daily returns (in decimal form) for a stock:
Q1. Scale using Z-score scaling.
\[ Value \rightarrow \frac{Value - Mean}{SD}\]
Q2. Scale using Min-Max scaling.
\[ Value \rightarrow \frac{Value - Minimum}{Maximum - Minimum}\]
Calculate straight line distance between two points
Euclidean Distance
Consider two points in a two-dimensional space:
A(2, 3) and B(7, 11)
Calculate the Euclidean distance d(A,B) between these two points using the formula:
\[ d(A,B) = \sqrt{(x_B - x_A)^2 + (y_B - y_A)^2} \]
The centroid of a cluster is the average of all points in that cluster
In 2D, it is the mean of all \(x\) and \(y\) coordinates: \[ \textrm{Centroid} = \left( \frac{1}{n} \sum_{i=1}^n x_i,\ \frac{1}{n} \sum_{i=1}^n y_i \right) \]
Acts as the center or representative point of the cluster
Centriod
Consider two points in a two-dimensional space:
A(2, 3) and B(7, 11)
Calculate the centroid.
Within cluster Sum of Squared (WSS) for cluster \(j\) is:
\[ WSS_j = \sum\limits_{i=1}^nd_i^2 \]
Within Cluster Sum of Squares (WSS)
A cluster has three data points:
P_1(1, 2), P_2(2, 4), and P_3(3, 6).
The centroid of this cluster is given as C(2, 4).
Calculate the WSS for this cluster, defined as:
\[
WSS = \sum_{i=1}^{n} d(P_i, C)^2
\]
Inertia is the Total “within cluster sum of squares (WSS)” of distance to centroid:
\[ \mathrm{Inertia} = \sum\limits_{j=1}^K\mathrm{WSS}_j \]
To evaluate quality of clusters formed by K-means
Similar to error metric for K-means (we try to minimize it)
Inertia
Suppose you have three clusters with the following within-cluster sums of squares (WSS):
How do you calculate the inertia for three clusters?
A popular clustring algorithm in Finance.
A Partitioning method that clusters dataset into k distinct subsets
Algorithm:
k cluster centroids that minimize distances within cluster data pointsChoose the number of clusters, K.
Randomly select K data points (without replacement) to serve as the initial centroids.
Input:
\(X = \{x_1, x_2, …, x_n\}\): a set of data points (each observation, \(n\) toal observations)
\(K\) : number of clusters
Output:
Algorithm:
Initialize
Repeat until convergence
For each \(x_i \in X\): find the nearest centroid \(\mu_j = argmin_{j\in{1,…,K}} \ d(x_i, \mu_j)\)
Assign \(C_j \leftarrow x_i\)
For each cluster \(C_j\) : update centroid, \(\mu_j = \frac{1}{|C_j|}\sum\limits_{x_i\in C_j} x_i\)
Termination
Ks?Choosing the appropriate number of clusters (K) is crucial:
Then, how can we choose K?
Ks?Heuristics: if strong prior exists (i.e. theory, regulations)
With statistical methods:
Calculate Inertia (i.e. total within-cluster sum of squares, WSS):
K from small to largeSilhouette score measures how close each point \(i\) in one cluster is to the points in neighboring clusters.
\[ S_i = \frac{b_i - a_i}{\max{[a_i, b_i]}} \]
- $a_i$: $i$'s average distance within cluster points
- $b_i$: $i$'s average distance with neighbor points (K-1), then pick minimum
Compares inertia for different values of K with random clustering.
Idea: the compactness of clustering should be better than that of random clustering
\[ Gap(k) = m_k - w_k \]
Choose \(k\) where
\[ Gap(k) >= Gap(k+1) - s_{k+1} \]
The concept of “distance” becomes less meaningful in large dimension
1000 Randomized samples with varing dimensions 1 to 500
K-means are sensitive to initialization (seed selection)
Though K-means is unsupervised, Inertia can be used as if an error metric as in supervised learning. (h2o provides this feature.)
Open-source, in-memory platform for distributed, scalable machine learning
Integrates with big data infrastructure (e.g., Hadoop and Spark)
Provides a broad ML algorithms
AutoML : Automatic model training
High-Performance: Written in Java, robust and fast for large datasets
Multi-Language Support: Accessible via R, Python, Java, and Scala
Web API & GUI: Offers an interactive web interface for model monitoring and management
H2O’s R API syntax mostly follows the base R.
Most of dplyr verbs doesn’t directly work on H2O Frame.
In the lab walkthrough, we will use h2o to perform K-Means clustering. The workflow includes:
Initializing H2O
Importing or converting data into H2O frames
Training ML models using h2o algorithms
Evaluating and interpreting results
Exporting predictions or models
# A tibble: 4 × 6
Country Abbrev Corruption Peace Legal `GDP Growth`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Albania AL 35 1.82 4.55 2.98
2 Algeria DZ 35 2.22 4.43 2.55
3 Argentina AR 45 1.99 5.09 -3.06
4 Armenia AM 42 2.29 4.81 6
# A tibble: 4 × 6
country abbrev corruption peace legal gdp_growth
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Albania AL 35 1.82 4.55 2.98
2 Algeria DZ 35 2.22 4.43 2.55
3 Argentina AR 45 1.99 5.09 -3.06
4 Armenia AM 42 2.29 4.81 6
Evaluating country-level risk is important in international finance:
Challenge is, there are so many nations!
Pairwise plots to explore variable relationships:
Initiate a h2o java server locally with h2o.init().
Tip
If there is an external server running, you can connect to the server instead with h2o.init(). Check the documentations.
Since the H2O server is running separately, you need to upload the data to the server.
If you have a data.frame in R, send the data to H2O with as.h2o()
Let’s dry run a kmeans algorithm with K = 3. We’ll expand our analysis with elbow method to determine optimal K.
Suppose we wanted to use K=3 to cluster our data. We make model predictions with the k-means model.
The elbow method helps us choose a good value for K (the number of clusters) by:
Trying several values of K (e.g., K = 1 to 20).
Measuring how well the clusters fit the data (using the Total Within-Cluster Sum of Squares, or WSS).
Plotting WSS vs K.
Looking for an “elbow” — the point where adding more clusters stops giving big improvements.
Let’s use K = 6 for our final model.
# A tibble: 6 × 7
country abbrev corruption peace legal gdp_growth cluster
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
1 Albania AL 35 1.82 4.55 2.98 0
2 Algeria DZ 35 2.22 4.43 2.55 4
3 Argentina AR 45 1.99 5.09 -3.06 4
4 Armenia AM 42 2.29 4.81 6 0
5 Australia AU 77 1.42 8.36 1.71 3
6 Austria AT 77 1.29 8.09 1.60 3
Let’s skim the first 2 observations of each group.
# A tibble: 12 × 7
# Groups: cluster [6]
country abbrev corruption peace legal gdp_growth cluster
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
1 Albania AL 35 1.82 4.55 2.98 0
2 Armenia AM 42 2.29 4.81 6 0
3 Iran IR 26 2.54 4.58 -9.46 1
4 Nicaragua NI 22 2.31 4.34 -5.04 1
5 Burundi BI 19 2.52 3.80 0.419 2
6 Cameroon CM 25 2.54 4.31 4.00 2
7 Australia AU 77 1.42 8.36 1.71 3
8 Austria AT 77 1.29 8.09 1.60 3
9 Algeria DZ 35 2.22 4.43 2.55 4
10 Argentina AR 45 1.99 5.09 -3.06 4
11 Botswana BW 61 1.68 5.96 3.48 5
12 Chile CL 67 1.63 6.88 2.52 5
Visualize summary statistics on each cluster group. How do you interpret the grouping results?
# A tibble: 6 × 5
cluster corruption peace legal gdp_growth
<int> <dbl> <dbl> <dbl> <dbl>
1 0 37.7 2.01 5.00 5.32
2 1 24 2.44 4.22 -7.19
3 2 26.6 2.84 4.28 2.50
4 3 80.5 1.42 8.18 1.48
5 4 37.2 2.12 5.18 1.54
6 5 59.6 1.75 6.65 2.43
country_risk |>
group_by(cluster) |>
summarize(across(!c(country, abbrev), mean)) |>
pivot_longer(-cluster) |>
ggplot(aes(x = as.factor(cluster), y = value, fill = name)) +
geom_col(position = "dodge", width = 0.5) +
labs(
x = "Group",
y = "Index Value",
fill = "Category",
title = "Cluster Profiles (K=6)"
) +
theme_bw()Silhouette score can be computed using cluster::silhouette(). It requires:
library(cluster)
get_silhouette <- function(k) {
# Fit k-means using H2O
km_model <- h2o.kmeans(
training_frame = country_risk_h2o,
x = c("corruption", "peace", "legal", "gdp_growth"),
k = k,
standardize = TRUE,
seed = 123
)
# Step 2: Get cluster assignments as a vector
clusters <- as.vector(h2o.predict(km_model, country_risk_h2o))
# Step 3: Compute distance matrix
dist_matrix <- country_risk |>
select(where(is.numeric)) |>
dist()
# Step 4: Compute silhouette scores
sil <- silhouette(clusters, dist_matrix)
# Step 5: Return average silhouette score for k
return(mean(sil[, 3]))
}
# Create summary table
silhouette_summary <- tibble(
K = 2:20,
Avg_sil = map_dbl(2:20, get_silhouette)
)mtcarsmtcars data is available in R. Load it with data(mtcars).
Use all numeric variables for clustering.
Perform elbow (and silhouette) method to determine best k.
With your choice of k, describe the clusters in detail including visualizations.
Report in Quarto .html.
John C. Hull “Machine Learning in Business”
FIN6776: Big Data and Machine Learning in Finance